This report provides an evaluation of the accuracy and precision of probabilistic forecasts of COVID-19 cases and deaths submitted to the US COVID-19 Forecast Hub. Some analyses include forecasts submitted starting in April 2020. Others focus on evaluating “recent” forecasts, submitted only in the last 10 weeks.
In collaboration with the US Centers for Disease Control and Prevention (CDC), the COVID-19 Forecast hub collects short-term COVID-19 forecasts from dozens of research groups around the globe. Every Tuesday morning we combine the most recent forecasts from each team into a single “ensemble” forecast for each of the target submissions. This forecast is used as the official ensemble forecast of the CDC, typically appearing on their forecasting website on Wednesday.
The LeaderBoard table included below evaluated models based on their prediction interval coverage (50% and 95% coverage), and their adjusted relative weighted interval scores (WIS) and adjusted relative mean absolute error (MAE) for a recent and historical number of weeks.
The prediction interval coverage measures the proportion of times a prediction interval of a certain level covered the true value, to assess the degree to which forecasts accurately characterized uncertainty about future observations. Well calibrated models should have a 50% coverage level of 0.5 and a 95% coverage level of 0.95.
The weighted interval score (WIS) is a proper score that combines a set of interval scores for probabilistic forecasts that provide quantiles of the predictive forecast distribution. To account for the variation in difficult of forecasting different weeks and locations, a pairwis approach was used to calculated the relative adjusted WIS. The code for this comparison can be found here. A preprint on this method for calculating the WIS can be found here. Models with an adjusted relative WIS lower than 1 are more accurate than the baseline, and models with an adjusted relative WIS greater than 1 are less accurate than the baseline is predicting the number of incident cases.
The mean absolute error is defined as the the difference between the forecasted value and the actual value. For this calculation, point forecasts submitted by teams were used. When point forecasts were not available, the 0.50 quantile was used. Similar to calculating the WIS, a pairwise approach was used to account for variation across locations and weeks. Models with an adjusted relative MAE lower than 1 are more accurate than the baseline, and models with an adjusted relative MAE greater than 1 are less accurate than the baseline is predicting the number of incident cases.
Inclusion criteria for each column are detailed below the table.
In order to calculate each column in our table, different inclusion criteria were applied.
The first column in the table lists all models that have contributed forecasts for 5 or more weeks total since the beginning of April, or models that have submitted forecasts during at least 2 out of the last 3 evaluated weeks. This inclusion criteria was applied in order to score models that submitted for a substantial amount of weeks at any point during the pandemic but may no longer be submitting, but also to evaluate new teams that have recently joined our forecasting efforts.
The next column lists the number of forecasts a team has submitted with a target end date over the most recent 10 week period.
Columns 3 and 4 calculate the adjusted relative WIS over the most recent 10 week period and the adjusted relative MAE over the most recent 10 week period. For inclusion in these rows, a model must have forecasts for 50% or more of the evaluated forecasts in the most recent evaluation period.
The inclusion criteria for the calibration coverage levels are the same as for the first two columns.
Column 7 shows the number of historical models a team has submitted. All teams that have submitted at least 5 forecasts and/or 2 forecasts out of the last 3 weeks is included in this count.
Columns 8 and 9 show the adjusted WIS and adjusted MAE over a historical period beginning the first week in March. For inclusion in this figure, a model must have predictions for 50% or more of the evaluated forecasts in the historical evaluation period.
This figure shows the number of incident cases reported each week. The period between the vertical lines shows the number of weeks for which models were evaluated.
In the following figures, we have evaluated models across multiple forecasting weeks. Points included in this comparison are for models that have submitted probabilistic forecasts for all 50 states.
For the first 2 figures, WIS is used as a metric. The first figure shows the mean WIS across all 50 states for submission weeks beginning the first week in April at a 1 week horizon. The second figure shows the mean WIS aggregated across locations, however it is for a 4 week horizon.
To view a specific team, double click on the team names in the legend. To view a value on the plot, click on the point in the forecast of interest. To view a specific time of interest, highlight that section on the graph or use the zoom functionality.
In this figure, the dotted black line represents the average 1 week ahead error. There is often larger error for the 4 week horizon compared to the 1 week horizon.
We would expect a well calibrated model to have a value of 95% in this plot.
We would expect a well calibrated model to have a value of 95% in this plot. There is typically larger error for the 4 week horizon compared to the 1 week horizon.
The LeaderBoard table included below evaluated models based on their prediction interval coverage (50% and 95% coverage), and their adjusted relative weighted interval scores (WIS) and adjusted relative mean absolute error (MAE) for a recent and historical number of weeks.
The prediction interval coverage measures the proportion of times a prediction interval of a certain level covered the true value, to assess the degree to which forecasts accurately characterized uncertainty about future observations. Well calibrated models should have a 50% coverage level of 0.5 and a 95% coverage level of 0.95.
The weighted interval score (WIS) is a proper score that combines a set of interval scores for probabilistic forecasts that provide quantiles of the predictive forecast distribution. To account for the variation in difficult of forecasting different weeks and locations, a pairwis approach was used to calculated the relative adjusted WIS. The code for this comparison can be found here. A preprint on this method for calculating the WIS can be found here. Models with an adjusted relative WIS lower than 1 are more accurate than the baseline, and models with an adjusted relative WIS greater than 1 are less accurate than the baseline is predicting the number of incident cases.
The mean absolute error (MAE) is defined as the the difference between the forecasted value and the actual value. For this calculation, point forecasts submitted by teams were used. When point forecasts were not available, the 0.50 quantile was used. Similar to calculating the WIS, a pairwise approach was used to account for variation across locations and weeks. Models with an adjusted relative MAE lower than 1 are more accurate than the baseline, and models with an adjusted relative MAE greater than 1 are less accurate than the baseline is predicting the number of incident cases.
Inclusion criteria for each column are detailed below the table.
In order to calculate each column in our table, different inclusion criteria were applied.
The first column in the table lists all models that have contributed forecasts for 5 or more weeks total since the beginning of April, or models that have submitted forecasts during at least 2 out of the last 3 evaluated weeks. This inclusion criteria was applied in order to score models that submitted for a substantial amount of weeks at any point during the pandemic but may no longer be submitting, but also to evaluate new teams that have recently joined our forecasting efforts.
The next column lists the number of forecasts a team has submitted with a target end date over the most recent 10 week period.
Columns 3 and 4 calculate the adjusted relative WIS over the most recent 10 week period and the adjusted relative MAE over the most recent 10 week period. For inclusion in these rows, a model must have forecasts for 50% or more of the evaluated forecasts in the most recent evaluation period.
The inclusion criteria for the calibration coverage levels are the same as for the first two columns.
Column 7 shows the number of historical models a team has submitted. All teams that have submitted at least 5 forecasts and/or 2 forecasts out of the last 3 weeks is included in this count.
Columns 8 and 9 show the adjusted WIS and adjusted MAE over a historical period beginning the first week in March. For inclusion in this figure, a model must have predictions for 50% or more of the evaluated forecasts in the historical evaluation period.
This plot shows the observed number of incident deaths over the evaluation period.
In the following figures, we have evaluated models across multiple forecasting weeks. The models included in this comparison must have submitted forecasts for all 50 states and at a national level for each timepoint.
For the first 2 figures, WIS is used as a metric. The first figure shows the mean WIS across all locations for each submission week at a 1 week horizon. The second figure shows the mean WIS aggregated across locations, however it is for a 4 week horizon.
To view a specific team, double click on the team names in the legend. To view a value on the plot, click on the point in the forecast of interest.To view a specific time of interest, highlight that section on the graph or use the zoom functionality.
In this figure, the dotted black line represents the average 1 week ahead error. There is larger variation in error for the 4 week horizon compared to the 1 week horizon.
The black line represents 95%
The black line represents 95%